In the rapidly evolving landscape of financial technology, credit scoring remains a cornerstone in determining the creditworthiness of applicants. This paper delves into the realm of predictive modeling, using a dataset provided by our professors, enriched with comprehensive credit scoring information. The overarching objective of our study is to train a model that efficiently and accurately determines the eligibility of individuals for credit approval.
The task set forth by our professors involves a comprehensive journey through four key exercises. The first exercise is centered around a thorough analysis of the dataset. Here, we delve into understanding the underlying structures, identifying critical variables, and most importantly, balancing the dataset to ensure a fair representation of diverse credit scenarios.
Subsequently, in the second exercise, we focus on training and testing a logistic classifier. This step is crucial in establishing a baseline model from which further improvements can be measured.
The third exercise involves enhancing the predictive performance of the model. This stage is particularly challenging and significant, as it entails fine-tuning the model to ensure it captures the nuances of credit scoring with greater accuracy and efficiency.
Finally, the fourth exercise invites us to step into the practical world. Here, we explore the various challenges a company may face if they were to implement our model in a real-world scenario. This exercise not only grounds our theoretical work in reality but also provides valuable insights into the practical implications and considerations in deploying machine learning models in the credit industry.
This paper aims to not only present a robust model for credit scoring but also to contribute to the broader understanding of how machine learning can be effectively utilized in financial decision-making processes.
str(loan_sample)
## spc_tbl_ [40,000 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ loan_amnt : num [1:40000] 10000 16000 6000 9000 24000 8000 3600 5550 2000 10000 ...
## $ int_rate : num [1:40000] 11.47 9.49 17.99 9.75 12.59 ...
## $ grade : chr [1:40000] "B" "B" "D" "B" ...
## $ home_ownership : chr [1:40000] "RENT" "MORTGAGE" "RENT" "MORTGAGE" ...
## $ annual_inc : num [1:40000] 35000 110000 40000 54000 66000 ...
## $ verification_status: chr [1:40000] "Verified" "Not Verified" "Verified" "Verified" ...
## $ purpose : chr [1:40000] "debt_consolidation" "debt_consolidation" "home_improvement" "car" ...
## $ dti : num [1:40000] 22.05 15.67 17.34 3.58 12.25 ...
## $ open_acc : num [1:40000] 15 9 8 11 8 10 8 5 8 14 ...
## $ revol_bal : num [1:40000] 10211 10068 3755 1459 29656 ...
## $ revol_util : num [1:40000] 31.5 32.7 25.5 22.8 70.6 34.3 36.1 66.1 24.6 75.6 ...
## $ total_acc : num [1:40000] 17 21 19 34 21 17 9 12 30 21 ...
## $ total_rec_int : num [1:40000] 1574 2364 943 436 2778 ...
## $ application_type : chr [1:40000] "Individual" "Individual" "Individual" "Joint App" ...
## $ tot_cur_bal : num [1:40000] 17440 35104 19995 8713 29656 ...
## $ total_rev_hi_lim : num [1:40000] 32400 30800 14700 6400 42000 28200 21400 6200 17500 30300 ...
## $ Status : num [1:40000] 0 0 0 0 0 0 1 0 0 0 ...
## - attr(*, "spec")=
## .. cols(
## .. loan_amnt = col_double(),
## .. int_rate = col_double(),
## .. grade = col_character(),
## .. home_ownership = col_character(),
## .. annual_inc = col_double(),
## .. verification_status = col_character(),
## .. purpose = col_character(),
## .. dti = col_double(),
## .. open_acc = col_double(),
## .. revol_bal = col_double(),
## .. revol_util = col_double(),
## .. total_acc = col_double(),
## .. total_rec_int = col_double(),
## .. application_type = col_character(),
## .. tot_cur_bal = col_double(),
## .. total_rev_hi_lim = col_double(),
## .. Status = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Checking for NAs
any(is.na(loan_sample))
## [1] FALSE
The result of “FALSE” from the check indicates that there are no missing values (NAs) in the “loan_sample” dataset.
# First we have to check whether all variables have been imported in the correct format.
first_overview <- overview(loan_sample)
plot(first_overview)
It is clearly visible that the columns with the data type character
still have to be converted to factors. In addition, the “Status” column
is currently still a numeric data type. This must also be converted to
factor, as it is a binary variable.
loan_sample <- loan_sample %>%
mutate_if(is.character, as.factor) %>% # Convert all character columns to factors
mutate(Status = as.factor(Status)) # Convert 'Status' column to factor
# Plot to see, if the datatypes are correct
second_overview <- overview(loan_sample)
plot(second_overview)
That looks right, now we can count the number of variables.
# Count numeric variables in 'loan_sample'
numeric_vars_count <- sum(sapply(loan_sample, is.numeric))
# Count categorical (factor) variables in 'loan_sample'
categorical_vars_count <- sum(sapply(loan_sample, is.factor))
# Print counts of numeric and categorical variables
cat("There are", numeric_vars_count, "numeric variables and", categorical_vars_count, "categorical variables in the dataset.")
## There are 11 numeric variables and 6 categorical variables in the dataset.
categorical_levels <- sapply(loan_sample, function(x) if(is.factor(x)) length(unique(x)) else NA)
# Find the name of the categorical variable with the most levels
cat_var_most_levels <- names(which.max(categorical_levels))
# Find the number of levels for this variable
levels_count <- max(categorical_levels, na.rm = TRUE)
# Print the variable name and the number of levels
cat("The categorical variable with the most levels in the dataset is:", cat_var_most_levels, "with", levels_count, "levels.")
## The categorical variable with the most levels in the dataset is: purpose with 13 levels.
# Categorial Label with the most Levels
summary(loan_sample)
## loan_amnt int_rate grade home_ownership annual_inc
## Min. : 1000 Min. : 5.31 A: 7329 MORTGAGE:17701 Min. : 5000
## 1st Qu.: 7000 1st Qu.: 9.44 B:13166 OWN : 3981 1st Qu.: 42000
## Median :10075 Median :12.29 C:11842 RENT :18318 Median : 57000
## Mean :11687 Mean :12.62 D: 7663 Mean : 63400
## 3rd Qu.:15000 3rd Qu.:15.05 3rd Qu.: 77000
## Max. :40000 Max. :27.49 Max. :400000
##
## verification_status purpose dti
## Not Verified :14373 debt_consolidation:23342 Min. : 0.00
## Source Verified:16116 credit_card : 9362 1st Qu.:12.13
## Verified : 9511 other : 2337 Median :17.60
## home_improvement : 2099 Mean :18.22
## major_purchase : 794 3rd Qu.:23.86
## medical : 444 Max. :60.14
## (Other) : 1622
## open_acc revol_bal revol_util total_acc
## Min. : 1.00 Min. : 0 Min. : 0.00 Min. : 3.00
## 1st Qu.: 8.00 1st Qu.: 5615 1st Qu.: 34.80 1st Qu.:15.00
## Median :10.00 Median : 9818 Median : 52.40 Median :20.00
## Mean :10.31 Mean :11995 Mean : 52.17 Mean :21.27
## 3rd Qu.:13.00 3rd Qu.:15832 3rd Qu.: 70.00 3rd Qu.:27.00
## Max. :23.00 Max. :78762 Max. :121.40 Max. :57.00
##
## total_rec_int application_type tot_cur_bal total_rev_hi_lim
## Min. : 0.0 Individual:39450 Min. : 0 Min. : 300
## 1st Qu.: 673.2 Joint App : 550 1st Qu.: 25136 1st Qu.: 13000
## Median :1342.5 Median : 53722 Median : 20900
## Mean :1818.5 Mean : 99100 Mean : 24194
## 3rd Qu.:2432.9 3rd Qu.:157834 3rd Qu.: 32200
## Max. :8834.9 Max. :472573 Max. :100000
##
## Status
## 0:34810
## 1: 5190
##
##
##
##
##
Our loan data includes information on 40,000 loans. The average loan size is about $11,687, but this number can be as low as $1,000 or as high as $40,000, showing that loan amounts are different for different people. The average interest rate on a loan is 12.62%, but some people get lower rates like 5.31%, and others get higher rates up to 27.49%. This shows there are many different interest rates that people are getting. People who have taken out loans make an average of $63,400 a year, but there is a big difference in how much money people make. Some make more, and some make less. The debt-to-income ratio (DTI) tells us how much debt people have compared to their income. On average, this number is 18.22% for the loans we looked at. Looking at how many credit lines people have open, the average is 10, but in total, they might have had up to 21 over time. This tells us about how many loans or credit cards people might be using or have used before. The average amount people owe on their credit lines is $11,995, and they use about half of the credit available to them. This tells us about how much of their available loan money they are using. People also pay about $1,818 in interest on average. This is extra money they pay in addition to the loan they took out. When we look at all the loans and credit people have, the total amount they owe on average is $99,100. This shows that people have a lot of loans or owe a lot of money on their credit lines. Finally, the average limit people have on their credit lines is $24,194. This is how much the bank or loan company lets them borrow in total.
# Plot the distribution of the target variable
ggplot(loan_sample, aes(x = Status, fill = Status)) +
geom_bar() +
scale_fill_brewer(palette = "Set1") +
labs(title = 'Distribution of target variable (Status)', x = 'Status', y = 'Count') +
theme_minimal()
The target variable ‘Status’ in our dataset distinctly categorizes
credit approval outcomes into two levels: ‘0’ represents a ‘No’,
indicating that the credit has not been approved, whereas ‘1’ signifies
a ‘Yes’, confirming the approval of the credit.
The target variable in our dataset shows a significant imbalance, with a substantially higher number of credits not being approved. This imbalance is important to address because, in machine learning, an unbalanced dataset can lead to biased models that over-predict the majority class, in this case, the non-approved credits. Consequently, balancing the dataset is crucial for training an accurate and reliable logistic classifier that fairly represents both outcomes.
1.1.5: Check the distribution of the numeric variables in the data set (include different visual representations).
# Histograms
loan_sample %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~variable, scales = 'free_x') +
theme_minimal()
The histogram we created for the variable open_acc has gaps. We should examine this more closely.
# Create boxplot of the variable total_acc
ggplot(loan_sample, aes(x = open_acc)) +
geom_histogram(bins = 22) + # Adjust the number of bins as necessary
labs(x = "Open Accounts", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Since the variable ranges from 1 to 22, using 30 bins leads to empty
spaces. Adjusting the bins to match the actual value range has provided
a clearer, gap-free histogram.
# Boxplots
loan_sample %>%
select_if(is.numeric) %>%
gather(key = "variable", value = "value") %>%
ggplot(aes(y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = 'free') +
theme_minimal()
Elaborate your view on how to proceed in dealing with the outliers and – if necessary – take appropriate action.
# Reshape the data to long format and scale the numeric values
loan_sample_long_scaled <- loan_sample %>%
select(where(is.numeric)) %>%
mutate(across(everything(), scale)) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value")
# Create a boxplot with the scaled values
ggplot(loan_sample_long_scaled, aes(x = variable, y = value)) +
geom_boxplot() +
labs(x = "", y = "Scaled Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
The box plot illustrated the the distribution of scaled numeric variables. Outliers are visible as individual points outside the box plot. It is visible that most of the variables have some outliers.
When considering how to deal with outliers, it is important to understand their nature and the potential impact on the analysis. If outliers represent errors or anomalies that are not characteristic of the population, they can be removed or corrected.
# Visualizing all variables with and without the outliers
loan_sample %>%
plot_outlier(diagnose_outlier(loan_sample) %>%
filter(outliers_ratio >= 1.5) %>%
select(variables) %>%
unlist())
# Function to check if a variable should be truncated
should_keep <- function(x, lower_perc, upper_perc) {
lower_bound <- quantile(x, lower_perc, na.rm = TRUE)
upper_bound <- quantile(x, upper_perc, na.rm = TRUE)
x >= lower_bound & x <= upper_bound
}
# Apply truncation and keep rows where all numeric variables are within the specified range
loan_sample_truncated <- loan_sample %>%
filter(if_all(where(is.numeric), ~ should_keep(., 0.01, 0.99)))
loan_sample_balanced <- ovun.sample(Status ~ ., data=loan_sample_truncated, method = "under")
loan_sample_under <- data.frame(loan_sample_balanced[["data"]])
# Plot the balance
ggplot(loan_sample_under, aes(x = Status, fill = Status, group = Status)) +
geom_bar() +
ylab("Count") +
xlab("Status of the loan") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
ggtitle("New balanced dataset")
Balancing a dataset is important for training classification models
because it prevents the model from being biased towards the majority
class and potentially ignoring the minority class. This is particularly
true for datasets where the outcome class of interest (e.g., default on
a loan) is much less common than the non-interest class (e.g.,
non-default). Balancing helps to improve the generalization of the model
and its performance on unseen data, especially for the minority
class.
(i.e. default vs non-default). Discuss the visualizations. Which variables seem to be relevant in predicting the target feature?
# Define columns to exclude (these are column with characters)
exclude_columns <- c(3, 4, 6, 7, 14)
# Loop through columns, excluding the specified ones
for (i in setdiff(1:ncol(loan_sample_under), exclude_columns)) {
# Only plot numeric columns
if (is.numeric(loan_sample_under[[i]])) {
print(ggplot(loan_sample_under, aes_string(y = names(loan_sample_under)[i], color = "Status")) +
geom_boxplot() +
ylab(names(loan_sample_under[i])) +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()))
}
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Hier noch text schreiben!
In the next step, we run the Boruta algorithm.The Boruta algorithm provides a more objective method for determining the importance of features as it is based on repeated, random comparisons rather than human judgment.
## 1. run of importance source...
## 2. run of importance source...
## 3. run of importance source...
## 4. run of importance source...
## 5. run of importance source...
## 6. run of importance source...
## 7. run of importance source...
## 8. run of importance source...
## 9. run of importance source...
## 10. run of importance source...
## 11. run of importance source...
## After 11 iterations, +17 secs:
## confirmed 11 attributes: annual_inc, dti, grade, home_ownership, int_rate and 6 more;
## rejected 2 attributes: application_type, purpose;
## still have 3 attributes left.
## 12. run of importance source...
## 13. run of importance source...
## 14. run of importance source...
## 15. run of importance source...
## After 15 iterations, +23 secs:
## confirmed 2 attributes: total_acc, verification_status;
## still have 1 attribute left.
## 16. run of importance source...
## 17. run of importance source...
## 18. run of importance source...
## 19. run of importance source...
## 20. run of importance source...
## 21. run of importance source...
## 22. run of importance source...
## After 22 iterations, +33 secs:
## confirmed 1 attribute: open_acc;
## no more attributes left.
## [1] "loan_amnt" "int_rate" "grade"
## [4] "home_ownership" "annual_inc" "verification_status"
## [7] "dti" "open_acc" "revol_bal"
## [10] "revol_util" "total_acc" "total_rec_int"
## [13] "tot_cur_bal" "total_rev_hi_lim"
Hier weiter machen Abishan!!!
Now, we can examine multicollinearity, which occurs when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This situation can result in skewed or misleading coefficient estimates and conclusions.
# Generate a bar plot for each categorical variable
categorical_vars <- c("home_ownership", "verification_status", "purpose", "addr_state", "application_type")
# Loop over categorical variables and plot using aes() and without aes_string()
for (cat_var in categorical_vars) {
# Check if the column exists to avoid errors
if (!cat_var %in% names(loan_sample_under)) {
message(paste("Skipping", cat_var, "as it is not found in the dataset."))
next
}
# Create the plot
plot <- loan_sample_under %>%
group_by(.data[[cat_var]], Status) %>%
summarise(Count = n(), .groups = 'drop') %>%
ggplot(aes(x = .data[[cat_var]], y = Count, fill = as.factor(Status))) +
geom_bar(stat = "identity", position = position_dodge()) +
labs(title = paste("Bar plot of", cat_var, "by Status"), x = cat_var, y = "Count") +
scale_fill_discrete(name = "Status") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) # Rotate x labels for readability
# Print the plot
print(plot)
}
## Skipping addr_state as it is not found in the dataset.
Hier Text schreiben!
library(plotly)
# Assuming your data is in a dataframe called loan_sample and
# the relevant columns are named loan_amnt and annual_inc
plot_ly(data = loan_sample_under, x = ~loan_amnt, y = ~annual_inc,
type = 'scatter', mode = 'markers',
marker = list(size = 10, opacity = 0.5)) %>%
layout(title = 'Association between Loan Amount and Annual Income',
xaxis = list(title = 'Loan Amount Requested'),
yaxis = list(title = 'Annual Income of Borrower'))
The areas of high density indicate common combinations of income and loan amounts, which could suggest standard loan products or typical borrower profiles. A wide range of incomes with a relatively narrow range of loan amounts might suggest that the loan amount is less sensitive to income past a certain threshold.
Hier Text schreiben !!! ### 2.1: Dividing the sample into training and testing set
# Split the data into training (70%) and testing (30%) sets
splitIndex <- createDataPartition(loan_sample_under$Status, p = 0.7, list = FALSE)
training_set <- loan_sample_under[splitIndex,]
testing_set <- loan_sample_under[-splitIndex,]
Hier Text schreiben!
PercTable(loan_sample_under$Status)
##
## freq perc
##
## 0 4'282 50.1%
## 1 4'270 49.9%
PercTable(training_set$Status)
##
## freq perc
##
## 0 2'998 50.1%
## 1 2'989 49.9%
PercTable(testing_set$Status)
##
## freq perc
##
## 0 1'284 50.1%
## 1 1'281 49.9%
In the next step, we train the logit model. In terms of our inputs i.e. our Xs, we use all variables included in the data_new_under apart from the status, which is our Y. How would you interpret the results printed from the summary fit1?
##
## Call:
## glm(formula = Status ~ ., family = binomial(), data = training_set)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.451e+00 4.004e-01 -3.624 0.000291 ***
## loan_amnt 4.205e-05 6.971e-06 6.031 1.63e-09 ***
## int_rate 1.326e-01 1.904e-02 6.962 3.36e-12 ***
## gradeB 2.144e-01 1.153e-01 1.859 0.063024 .
## gradeC 4.232e-01 1.572e-01 2.693 0.007086 **
## gradeD 3.834e-01 2.348e-01 1.633 0.102490
## home_ownershipOWN -1.699e-01 1.067e-01 -1.593 0.111208
## home_ownershipRENT 1.221e-01 7.694e-02 1.587 0.112436
## annual_inc -4.863e-06 1.405e-06 -3.462 0.000536 ***
## verification_statusSource Verified 6.119e-02 6.498e-02 0.942 0.346428
## verification_statusVerified 9.826e-02 7.560e-02 1.300 0.193699
## purposecredit_card -8.491e-01 3.171e-01 -2.678 0.007403 **
## purposedebt_consolidation -9.299e-01 3.140e-01 -2.962 0.003059 **
## purposehome_improvement -7.987e-01 3.358e-01 -2.378 0.017399 *
## purposehouse -1.009e+00 5.557e-01 -1.816 0.069385 .
## purposemajor_purchase -1.029e+00 3.750e-01 -2.744 0.006063 **
## purposemedical -8.102e-01 4.078e-01 -1.987 0.046966 *
## purposemoving -1.253e+00 5.102e-01 -2.457 0.014027 *
## purposeother -1.032e+00 3.330e-01 -3.097 0.001952 **
## purposerenewable_energy -9.322e-01 8.803e-01 -1.059 0.289613
## purposesmall_business -5.673e-01 4.322e-01 -1.313 0.189278
## purposevacation -6.323e-01 4.341e-01 -1.457 0.145247
## purposewedding -1.573e+00 1.007e+00 -1.561 0.118453
## dti 1.809e-02 4.142e-03 4.367 1.26e-05 ***
## open_acc 3.429e-02 1.057e-02 3.246 0.001172 **
## revol_bal -3.296e-06 9.783e-06 -0.337 0.736205
## revol_util 1.436e-03 2.462e-03 0.583 0.559685
## total_acc -6.939e-03 4.378e-03 -1.585 0.112966
## total_rec_int -2.398e-04 2.724e-05 -8.803 < 2e-16 ***
## application_typeJoint App -2.690e-02 2.312e-01 -0.116 0.907372
## tot_cur_bal -1.664e-06 4.423e-07 -3.762 0.000169 ***
## total_rev_hi_lim 1.739e-07 5.277e-06 0.033 0.973716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8299.7 on 5986 degrees of freedom
## Residual deviance: 7567.7 on 5955 degrees of freedom
## AIC: 7631.7
##
## Number of Fisher Scoring iterations: 4
We can print out only the significant variables with p-value lower than 0.05. We notice that 9 variables are found statistically significant.
## [1] "loan_amnt" "int_rate"
## [3] "gradeC" "annual_inc"
## [5] "purposecredit_card" "purposedebt_consolidation"
## [7] "purposehome_improvement" "purposemajor_purchase"
## [9] "purposemedical" "purposemoving"
## [11] "purposeother" "dti"
## [13] "open_acc" "total_rec_int"
## [15] "tot_cur_bal"
Next, we aim to evaluate the predictive performance of our model. To
do so, we will plot the ROC curve.
In the next step, we visualize the Precision/Recall Curve. This curve
summarizes the trade-off between the true positive rate and the positive
predictive value for a predictive model using different probability
thresholds.
fit1_precision <- performance(fit1_pred, measure = "prec", x.measure = "rec")
plot(fit1_precision, main="Fit1: Logit - Precision vs Recall")
# Extract the confusion matrix
cm <- confusionMatrix(as.factor(round(testing_set$fit1_score)), testing_set$Status)
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 821 443
## 1 463 838
##
## Accuracy : 0.6468
## 95% CI : (0.6279, 0.6653)
## No Information Rate : 0.5006
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.2936
##
## Mcnemar's Test P-Value : 0.5279
##
## Sensitivity : 0.6394
## Specificity : 0.6542
## Pos Pred Value : 0.6495
## Neg Pred Value : 0.6441
## Prevalence : 0.5006
## Detection Rate : 0.3201
## Detection Prevalence : 0.4928
## Balanced Accuracy : 0.6468
##
## 'Positive' Class : 0
##
– Report the AUC values and the overall accuracy and interpret the results.
# AUC
fit1_auc <- performance(fit1_pred, measure = "auc")
# Overall accuracy
accuracy <- sum(diag(cm$table)) / sum(cm$table)
# Print of the values
cat("AUC: ", fit1_auc@y.values[[1]]*100, "\nOverall Accuracy: ", accuracy)
## AUC: 69.59486
## Overall Accuracy: 0.6467836
Thinking about the pre-processing steps that you carried out before training the logistic classifier: ### 3.1: Can you think of a way to improve the predictive performance of your data?